Index

  1. Univariate Analysis for 1 distinct numeric variable
  2. Univariate Analysis for 1 distinct categorical variable
  3. Bivariate Analysis for 1 distinct pair of variables, where both are numeric.
  4. Bivariate Analysis 1 distinct pair of variables, where one variable is categorical and the other is numeric

Data set: New York City Airbnb Open Data

Source Link: https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data

Summary information about the data: Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present more unique, personalized way of experiencing the world. This dataset describes the listing activity and metrics in NYC, NY for 2019. This data file includes all needed information to find out more about hosts, geographical availability, necessary metrics to make predictions and draw conclusions. This public dataset is part of Airbnb, and the original source can be found on this website.

This dataset has 48895 observations of 16 Variables.

Variables in Dataset are

id : Listing ID

name : Listing Name

host_id : Host ID

host_name : Name of the Host

neighbourhood_group : Location

neighbourhood : Area

latitude : Latitude Coordinates

longitude : Longitude Coordinates

room_type : Listing Space Type

price : Price in Dollars

minimum_nights : amount of nights minimum

number_of_reviews : Number of Reviews

last_review : Latest Review

reviews_per_month : Number of Reviews per Month

calculated_host_listings_count : amount of listing per host

availability_365 : number of days when listing is available for booking

Variable types in Dataset are

id : Integer

name : Factor

host_id : Integer

host_name : Factor

neighbourhood_group : Factor

neighbourhood : Factor

latitude : Numberic

longitude : Numberic

room_type : Factor

price : Integer

minimum_nights : Integer

number_of_reviews : Integer

last_review : Factor

reviews_per_month : Numberic

calculated_host_listings_count : Integer

availability_365 : Integer

Loading Packages

library(tidyverse)
library(here)
library(ggplot2)
library(gridExtra)

To call the airbnb data we used here function that is loading data

airbnb<- read_csv(here("Data","AB_NYC_2019.csv"))
## Parsed with column specification:
## cols(
##   id = col_double(),
##   name = col_character(),
##   host_id = col_double(),
##   host_name = col_character(),
##   neighbourhood_group = col_character(),
##   neighbourhood = col_character(),
##   latitude = col_double(),
##   longitude = col_double(),
##   room_type = col_character(),
##   price = col_double(),
##   minimum_nights = col_double(),
##   number_of_reviews = col_double(),
##   last_review = col_date(format = ""),
##   reviews_per_month = col_double(),
##   calculated_host_listings_count = col_double(),
##   availability_365 = col_double()
## )

Data exploration

Calling Airbnb dataset

airbnb

Structure and features

str(airbnb)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 48895 obs. of  16 variables:
##  $ id                            : num  2539 2595 3647 3831 5022 ...
##  $ name                          : chr  "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
##  $ host_id                       : num  2787 2845 4632 4869 7192 ...
##  $ host_name                     : chr  "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
##  $ neighbourhood_group           : chr  "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
##  $ neighbourhood                 : chr  "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
##  $ latitude                      : num  40.6 40.8 40.8 40.7 40.8 ...
##  $ longitude                     : num  -74 -74 -73.9 -74 -73.9 ...
##  $ room_type                     : chr  "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
##  $ price                         : num  149 225 150 89 80 200 60 79 79 150 ...
##  $ minimum_nights                : num  1 1 3 1 10 3 45 2 2 1 ...
##  $ number_of_reviews             : num  9 45 0 270 9 74 49 430 118 160 ...
##  $ last_review                   : Date, format: "2018-10-19" "2019-05-21" ...
##  $ reviews_per_month             : num  0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
##  $ calculated_host_listings_count: num  6 2 1 1 1 1 1 1 1 4 ...
##  $ availability_365              : num  365 355 365 194 0 129 0 220 0 188 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   name = col_character(),
##   ..   host_id = col_double(),
##   ..   host_name = col_character(),
##   ..   neighbourhood_group = col_character(),
##   ..   neighbourhood = col_character(),
##   ..   latitude = col_double(),
##   ..   longitude = col_double(),
##   ..   room_type = col_character(),
##   ..   price = col_double(),
##   ..   minimum_nights = col_double(),
##   ..   number_of_reviews = col_double(),
##   ..   last_review = col_date(format = ""),
##   ..   reviews_per_month = col_double(),
##   ..   calculated_host_listings_count = col_double(),
##   ..   availability_365 = col_double()
##   .. )

Brief Summary of dataset

summary(airbnb)
##        id               name              host_id           host_name        
##  Min.   :    2539   Length:48895       Min.   :     2438   Length:48895      
##  1st Qu.: 9471945   Class :character   1st Qu.:  7822033   Class :character  
##  Median :19677284   Mode  :character   Median : 30793816   Mode  :character  
##  Mean   :19017143                      Mean   : 67620011                     
##  3rd Qu.:29152178                      3rd Qu.:107434423                     
##  Max.   :36487245                      Max.   :274321313                     
##                                                                              
##  neighbourhood_group neighbourhood         latitude       longitude     
##  Length:48895        Length:48895       Min.   :40.50   Min.   :-74.24  
##  Class :character    Class :character   1st Qu.:40.69   1st Qu.:-73.98  
##  Mode  :character    Mode  :character   Median :40.72   Median :-73.96  
##                                         Mean   :40.73   Mean   :-73.95  
##                                         3rd Qu.:40.76   3rd Qu.:-73.94  
##                                         Max.   :40.91   Max.   :-73.71  
##                                                                         
##   room_type             price         minimum_nights    number_of_reviews
##  Length:48895       Min.   :    0.0   Min.   :   1.00   Min.   :  0.00   
##  Class :character   1st Qu.:   69.0   1st Qu.:   1.00   1st Qu.:  1.00   
##  Mode  :character   Median :  106.0   Median :   3.00   Median :  5.00   
##                     Mean   :  152.7   Mean   :   7.03   Mean   : 23.27   
##                     3rd Qu.:  175.0   3rd Qu.:   5.00   3rd Qu.: 24.00   
##                     Max.   :10000.0   Max.   :1250.00   Max.   :629.00   
##                                                                          
##   last_review         reviews_per_month calculated_host_listings_count
##  Min.   :2011-03-28   Min.   : 0.010    Min.   :  1.000               
##  1st Qu.:2018-07-08   1st Qu.: 0.190    1st Qu.:  1.000               
##  Median :2019-05-19   Median : 0.720    Median :  1.000               
##  Mean   :2018-10-04   Mean   : 1.373    Mean   :  7.144               
##  3rd Qu.:2019-06-23   3rd Qu.: 2.020    3rd Qu.:  2.000               
##  Max.   :2019-07-08   Max.   :58.500    Max.   :327.000               
##  NA's   :10052        NA's   :10052                                   
##  availability_365
##  Min.   :  0.0   
##  1st Qu.:  0.0   
##  Median : 45.0   
##  Mean   :112.8   
##  3rd Qu.:227.0   
##  Max.   :365.0   
## 

— Missing rows for “last_review” and “reviews_per_month” are the same, which makes sense. — Concluding that missing values do not require to be treated manually.

— Following variables or columns can be ommited since they don’t carry any useful information and hence wont’ be using in our analysis.

  • name
  • id,
  • host_id
  • last_review

— Also we will convert room_type and neighbourhood_group into factor from as they are categorical variables but when we read it from csv R is considering it as continuous variables.

airbnb_data <- airbnb %>% select(-id, -name, -host_id, -host_name, -last_review, -neighbourhood) %>% 
                 mutate(room_type = factor(room_type), neighbourhood_group = factor(neighbourhood_group), ) 
airbnb_data

— It is hard to work on price variable as it is dependent on minimum_nights so we cannot work on it so we will find price per night from the price and minimum_night variable

airbnb_clean <- airbnb_data %>% mutate(price_per_night = price/minimum_nights)
airbnb_clean

Exercise 1: Create an appropriate plot to visualize the distribution of this variable.

Creating a histogram for price_per_night variable as it is a distinct numberic variable

ggplot(airbnb_clean, aes(x = price_per_night)) + 
  geom_histogram(fill = 'skyblue', colour = 'black') + ggtitle("Distribution of Price")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

— Finding Maximum and Minimum value of price_per_night variable to find the range of Price

Minimum price_per_night

print(min(airbnb_clean$price_per_night))
## [1] 0

Maximum price_per_night

print(max(airbnb_clean$price_per_night))
## [1] 8000

Range of price_per_night

price_per_night_range = max(airbnb_clean$price_per_night) - min(airbnb_clean$price_per_night)
price_per_night_range
## [1] 8000

Dividing the price_per_night range by 30 to get a value for the default binwidth used by geom_histogram

default_bin = price_per_night_range/30
default_bin
## [1] 266.6667

Creating a histogram for price_per_night variable as it is a distinct numberic variable with proper binwidth

ggplot(airbnb_clean, aes(price_per_night)) + 
  geom_histogram(fill = 'skyblue', colour = 'black', binwidth = 267)

Exercise 2: Consider any outliers present in the data. If present, specify the criteria used to identify them and provide a logical explanation for how you handled them.

summary(airbnb_clean$price_per_night)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   20.00   44.50   70.17   81.50 8000.00
quantile(airbnb_clean$price_per_night,seq(0,1,by=0.1))
##          0%         10%         20%         30%         40%         50% 
##    0.000000    6.166667   15.000000   24.500000   33.333333   44.500000 
##         60%         70%         80%         90%        100% 
##   55.500000   72.500000   96.666667  140.000000 8000.000000

— The average price is 70 Dollars per night in New York City according to the dataset and if we go on internet and if we find average price for the New York City hotel rooms then

— Search Results for Moderate New York Hotel Rooms — New York City Hotel room rates start at under $300 a night. This is the average price of a New York City hotel that is well located and full-service. — Link for the above statemant is below — https://www.google.com/search?client=firefox-b-d&q=average+room+price+in+new+york+city+per+night

quantile(airbnb_clean$price_per_night,seq(0.0,1,by=0.01))
##          0%          1%          2%          3%          4%          5% 
##    0.000000    1.233333    1.637702    2.142857    2.741649    3.333333 
##          6%          7%          8%          9%         10%         11% 
##    3.928571    4.433333    5.000000    5.628833    6.166667    6.666667 
##         12%         13%         14%         15%         16%         17% 
##    7.428571    8.100000    8.972000   10.000000   10.714286   11.666667 
##         18%         19%         20%         21%         22%         23% 
##   12.800000   13.800000   15.000000   16.000000   16.666667   17.800000 
##         24%         25%         26%         27%         28%         29% 
##   18.750000   20.000000   20.000000   21.500000   22.500000   23.333333 
##         30%         31%         32%         33%         34%         35% 
##   24.500000   25.000000   25.600000   26.666667   28.000000   29.000000 
##         36%         37%         38%         39%         40%         41% 
##   30.000000   30.000000   32.000000   32.500000   33.333333   34.500000 
##         42%         43%         44%         45%         46%         47% 
##   35.000000   36.666667   37.500000   38.500000   40.000000   40.000000 
##         48%         49%         50%         51%         52%         53% 
##   41.666667   42.500000   44.500000   45.000000   46.666667   48.333333 
##         54%         55%         56%         57%         58%         59% 
##   49.500000   50.000000   50.000000   50.000000   52.500000   55.000000 
##         60%         61%         62%         63%         64%         65% 
##   55.500000   58.000000   60.000000   60.000000   61.666667   63.000000 
##         66%         67%         68%         69%         70%         71% 
##   65.000000   66.630667   68.500000   70.000000   72.500000   75.000000 
##         72%         73%         74%         75%         76%         77% 
##   75.000000   76.500000   80.000000   81.500000   85.000000   87.500000 
##         78%         79%         80%         81%         82%         83% 
##   90.000000   92.500000   96.666667   99.500000  100.000000  100.000000 
##         84%         85%         86%         87%         88%         89% 
##  105.000000  111.000000  116.666667  122.500000  125.000000  131.666667 
##         90%         91%         92%         93%         94%         95% 
##  140.000000  150.000000  150.000000  169.000000  180.000000  200.000000 
##         96%         97%         98%         99%        100% 
##  220.000000  250.000000  300.000000  443.000000 8000.000000

— We can see that our 99% of data lies under 443 dollars price per night. Which is also clearly shown in the above in detailed of quartiles function. So the data that comes after 443 that is 1% data which lies between 99% to 100% is considered as outliers. Also the properties listed with price of 0 dollars per night will also be considered as outliers as they are free properties.

To remove outliers

x <- airbnb_clean %>% select(price_per_night) %>% filter(price_per_night <= 433)
y <- x %>% select(price_per_night) %>% filter(price_per_night > 1)
y

Exercise 3: Describe the shape and skewness of the distribution.

— To Find the shape and skewness of price_per_night

Average

avg_price_per_night = mean(airbnb_clean$price_per_night)
avg_price_per_night
## [1] 70.17425
ggplot(airbnb_clean, aes(x = price_per_night)) + 
  geom_histogram(fill = 'skyblue', colour = 'black', binwidth = 267)+
geom_vline(xintercept = avg_price_per_night, linetype = "dashed", colour = 'red', size = 1) + ggtitle("Count of Price Per Night and showing  it's Average Value")

med_price_per_night = median(airbnb_clean$price_per_night)
med_price_per_night
## [1] 44.5
ggplot(airbnb_clean, aes(x = price_per_night)) + 
  geom_histogram(fill = 'skyblue', colour = 'black', binwidth = 267) +
  geom_vline(xintercept = med_price_per_night, linetype = "dashed", colour = 'blue', size = 1) +
  ggtitle("Count of Price Per Night and showing  it's Median Value")

ggplot(airbnb_clean, aes(x = price_per_night)) + 
  geom_histogram(fill = 'skyblue',colour = 'black', binwidth = 267) + 
  geom_vline(xintercept = avg_price_per_night, linetype = "dashed", colour = 'red', size = 1) + 
  geom_vline(xintercept = med_price_per_night, linetype = "dashed", colour = 'blue', size = 1) +
  ggtitle("Count of Price Per Night and Comparing it's Mean and Median Value")

— The above plot is showing that Mean value(that is average price_per_night value) is greater than the Median Value that we expect for the Right skewed distribution and shape is Unimodel.

Exercise 4: Based on your answer to the previous question, decide if it is appropriate to apply a transformation to your data. If no, explain why not. If yes, name the transformation applied and visualize the transformed distribution.

p1 <- ggplot(airbnb_clean, aes(x = price_per_night)) + 
  geom_histogram(fill = 'skyblue', colour = 'black', binwidth = 267)

p2 <- ggplot(airbnb_clean, aes(x = price_per_night)) + 
  geom_histogram(fill = 'skyblue', colour = 'black') + scale_x_log10() + xlab("Log10 of Price")

p3 <- ggplot(airbnb_clean, aes(x = price_per_night)) + 
  geom_histogram(fill = 'skyblue', colour = 'black') + scale_x_sqrt() + xlab("Square Root of Price")

grid.arrange(p1, p2, p3)

— Yes I have applied the log10 transformation as my data was highly skewed.

Exercise 5: Choose and calculate an appropriate measure of central tendency.

The mean value is:

avg_price_per_night = mean(airbnb_clean$price_per_night)
avg_price_per_night
## [1] 70.17425

The median value is:

med_price_per_night = median(airbnb_clean$price_per_night)
med_price_per_night
## [1] 44.5

Percent Difference:

(avg_price_per_night - med_price_per_night) / med_price_per_night
## [1] 0.5769494

— The mean is greater than the median, which confirms the skewness noted in the histogram. — So we will take Median as center tendency.

Exercise 6: Explain why you chose this as your measure of central tendency. Provide supporting evidence for your choice.

— The mean is greater than the median, which confirms the skewness noted in the histogram.

— It would be better to use the median as a measure of central tendency given that the distribution is skewed. We know that the median is more robust to outliers than the mean so this should give us a better sense of the centre of the data distribution.

Exercise 7: Choose and calculate a measure of spread that is appropriate for your chosen measure of central tendency. Explain why you chose this as your measure of spread.

We should use the interquartile range as a measure of spread since it is a more robust measure than standard deviation in the presence of skewness/ outliers.

— The interquartile range is:

IQR(airbnb_clean$price_per_night)
## [1] 61.5

For Categorical Variable:

Exercise 1: Create an appropriate plot to visualize the distribution of counts for this variable.

— Counting neighbourhood_manually

airbnb_clean %>% count(neighbourhood_group)
ggplot(airbnb,aes(x=neighbourhood_group))+
geom_bar(stat="count", fill = 'orange')+
geom_text(aes(label=..count..),stat="count",position=position_stack(), ) + ggtitle("The count of properties listed in neighbourhood_group")

Exercise 2: Create an appropriate plot to visualize the distribution of proportions for this variable.

— Visualizing neighbourhood_group variable’s distribution of proportions.

ggplot(airbnb_clean, aes(x = neighbourhood_group, y = ..prop.., group = 1)) +
  geom_bar(fill = "orange", color = "black", stat = "count") + ggtitle("Distribution of neighbourhood_group according to their Proportions")

— Manually calculating these proportions and verifing that the results are the same as what is shown in the previous bar plot.

airbnb_clean %>% group_by(neighbourhood_group) %>%
  summarise(n = n()) %>%
  mutate(prop = n / sum(n))

Exercise 3: Discuss any unusual observations for this variable?

— So we can say that Manhattan has the maximum number of properties on the other hand we can see that Staten Island has the minimum number of properties listed on Airbnb.

Exercise 4: Discuss if there are too few/too many unique values?

Desending

airbnb_clean %>% group_by(neighbourhood_group) %>%
  summarise(n = n()) %>%
  mutate(prop = n / sum(n)) %>%
  arrange(desc(n))

Assending

airbnb_clean %>% group_by(neighbourhood_group) %>%
  summarise(n = n()) %>%
  mutate(prop = n / sum(n)) %>%
  arrange(n)

— By viewing above values that Staten Island has the minimum number of properties that are listed on AIRBNB on the other hand we have Manhattan which is having the maximum number properties that are listed on AIRBNB.

Bivariate Analysis

1 distinct pair of numeric variables:

Exercise 1: Create an appropriate plot to visualize the relationship between the two variables.

ggplot(airbnb_clean, aes(x = number_of_reviews, y = price)) +
  geom_point(aes(size = price), alpha = 0.05, color = "slateblue") + 
  xlab("Number of reviews") +
  ylab("Price") +
  ggtitle("Relationship between number of reviews",
          subtitle = "The most expensive objects have small number of reviews (or 0)")

Exercise 2: Describe the form, direction, and strength of the observed relationship. Include both qualitative and quantitative measures, as appropriate.

— Let’s reproduce the above plot but with a best fit line:

ggplot(airbnb_clean, aes(x = number_of_reviews, y = price)) +
  geom_point(aes(size = price), alpha = 0.05, color = "slateblue") + 
  geom_smooth(method = 'lm', se=FALSE, color = "black") +
  xlab("Number of reviews") +
  ylab("Price") +
  ggtitle("Relationship between number of reviews",
          subtitle = "The most expensive objects have small number of reviews (or 0)")

— It appears that Price has a negative, linear relation to the Number of Reviews (or at least could potentially be modeled as such) but it appears to be somewhat of a weak relationship.

— The value of the correlation coefficient is on the weaker side, confirming what we see in the plot.

cor(airbnb_clean$price, airbnb_clean$number_of_reviews)
## [1] -0.04795423

Exercise 3: Explain what this relationship means in the context of the data.

— Although relatively weak, the relationship does show that, on average, as price decreases we can expect that number of reviwes will increase.

— The variation in Price clearly decreases as the number of reviwes are increasing.

— The data includes all types of room types with different minimum nights so that will explain the wide variation in price as the number of reviews increases. Since some properties have very less minimum number of nights to stay so the frequency of price and number reviews have a much stronger connection then others.

Exercise 4: Describe the variability that you observe in the plot and how that corresponds to the strength you calculated in #2 above.

— As the Strength is in quantitatively so we will use use the correlation coefficient for two numeric variables to understad the variability

— So the variability not at all close to 1 and also it has negative correlation coefficient and also has a very week relationship between price and number of reviews. We can see that from the linear line associated in the plot also. — The most expensive objects have small number of reviews (or 0) — Variability of the data seems not consistent across the Number of reviews.

1 distinct pair of variables, where one variable is categorical and the other is numeric

Exercise 1: Create an appropriate plot to visualize the relationship between the two variables.

ggplot(airbnb_clean, aes(x = room_type, y = price)) + 
geom_boxplot(alpha = 0.5) + 
  labs(x = 'Room Type', y = 'Price') + 
  scale_y_log10()

Exercise 2: Describe the form, direction, and strength of the observed relationship. Include both qualitative and quantitative measures, as appropriate.

— It appears that median Price is negatively related to Room Type. — The relationship appears to be and linearly related across the Room Type. — Strenght of the observed relationship is qualitatively and Variability is fairly consistent across the Room Type.

Exercise 3: Explain what this relationship means in the context of the data.

— In the context of data the relationship showing that the Entire home/apt are more costly and Shared rooms are the least expensive and we can also understand that entire home has more price then the private room and shared room.

Exercise 4: Describe the variability that you observe in the plot and how that corresponds to the strength you calculated in #2 above.

— Variability of the data seems fairly consistent across the Room Type. — We can see that the plot seems to be normally distributed after appling the log10 in price as price range is too big.

References: https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data

California State University Course Material

http://insideairbnb.com/new-york-city/

https://lms.stclaircollege.ca/bbcswebdav/pid-1603997-dt-content-rid-12522028_1/courses/DAB501-19F-002/R4DS_Chapter_7c_soln.html

https://lms.stclaircollege.ca/bbcswebdav/pid-1600945-dt-content-rid-12506788_1/courses/DAB501-19F-002/R4DS_Chapter_7a_soln_revised.html

I, Kushal Patel, hereby state that we have not communicated with or gained information in any way from any person or resource that would violate the College’s academic integrity policies, and that all work presented is our own. In addition, we also agree not to share our work in any way, before or after submission, that would violate the College’s academic integrity policies.